Building & Preparing a RSQLite Database Using R

A step-by-step guide for constructing an RSQLite database using Airbnb data.

Nils Dosaj Mikkelsen, Jose Lama https://example.com/norajones
2022-03-27

Introduction

For this project, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for the city of Bangkok in Thailand.

The data for our project can be found here

Libraries

The following libraries are used for data cleaning and database construction

# Data cleaning libraries
library(readr)
library(tidyr)
library(stringr)
library(tibble)
library(dplyr) 

# Database libraries
library(DBI)
library(RSQLite)

# ER Diagram libraries
library(dm)

Data/Database Connection

Begin by loading the downloaded data into R using the read_csv() command from the readr library.

data <- read_csv("../data/listings.csv.gz")

We also need to form a database connection object:

con <- dbConnect(RSQLite::SQLite(), "airdb.SQLite")

Useful Function

While we are actively working on our database construction, we may find it useful to periodically disconnect and reconstruct our database. Having to close R and delete the airdb.SQLite file can be tedious, to get around this, we use the following remove_live_database() function to disconnect and current working database while R is still open:

remove_live_database <- function(con){
    if(file.exists("airdb.SQLite")){
        if (exists("con")){
            dbDisconnect(con)   
        }
        file.remove("airdb.SQLite")
    }
}

Data

The data used in for this project is provided by Airbnb and can be found here. To get started, first download the listings.csv.gz from the previous link. In order to help streamline the process of constructing our database, it is beneficial to split the .csv file into a listing table and a host_info table. View the E/R (Entity/Relationship) diagram in the next section, to see how these two tables relate.

From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.

E/R Diagram

Initial Data

We can examine our initial data using the glimpse() command from the tibble library.

data %>% glimpse()
Rows: 17,040
Columns: 74
$ id                                           <dbl> 27934, 27979, 2~
$ listing_url                                  <chr> "https://www.ai~
$ scrape_id                                    <dbl> 2.021122e+13, 2~
$ last_scraped                                 <date> 2021-12-24, 20~
$ name                                         <chr> "Nice room with~
$ description                                  <chr> "Our cool and c~
$ neighborhood_overview                        <chr> "It is very cen~
$ picture_url                                  <chr> "https://a0.mus~
$ host_id                                      <dbl> 120437, 120541,~
$ host_url                                     <chr> "https://www.ai~
$ host_name                                    <chr> "Nuttee", "Emy"~
$ host_since                                   <date> 2010-05-08, 20~
$ host_location                                <chr> "Bangkok", "Ban~
$ host_about                                   <chr> "Hi All, I am n~
$ host_response_time                           <chr> "N/A", "N/A", "~
$ host_response_rate                           <chr> "N/A", "N/A", "~
$ host_acceptance_rate                         <chr> "N/A", "N/A", "~
$ host_is_superhost                            <lgl> FALSE, FALSE, F~
$ host_thumbnail_url                           <chr> "https://a0.mus~
$ host_picture_url                             <chr> "https://a0.mus~
$ host_neighbourhood                           <chr> "Victory Monume~
$ host_listings_count                          <dbl> 2, 2, 1, 1, 1, ~
$ host_total_listings_count                    <dbl> 2, 2, 1, 1, 1, ~
$ host_verifications                           <chr> "['email', 'pho~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRU~
$ host_identity_verified                       <lgl> TRUE, FALSE, FA~
$ neighbourhood                                <chr> "Samsen Nai, Ba~
$ neighbourhood_cleansed                       <chr> "Ratchathewi", ~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA,~
$ latitude                                     <dbl> 13.75983, 13.66~
$ longitude                                    <dbl> 100.5413, 100.6~
$ property_type                                <chr> "Entire condomi~
$ room_type                                    <chr> "Entire home/ap~
$ accommodates                                 <dbl> 3, 2, 2, 2, 2, ~
$ bathrooms                                    <lgl> NA, NA, NA, NA,~
$ bathrooms_text                               <chr> "1.5 baths", "1~
$ bedrooms                                     <dbl> 1, 1, 1, 1, 1, ~
$ beds                                         <dbl> 1, 2, 1, 1, 1, ~
$ amenities                                    <chr> "[\"Wifi\", \"H~
$ price                                        <chr> "$1,845.00", "$~
$ minimum_nights                               <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights                               <dbl> 90, 730, 730, 3~
$ minimum_minimum_nights                       <dbl> 3, 1, 60, 5, 1,~
$ maximum_minimum_nights                       <dbl> 3, 1, 60, 5, 1,~
$ minimum_maximum_nights                       <dbl> 90, 730, 730, 3~
$ maximum_maximum_nights                       <dbl> 90, 730, 730, 3~
$ minimum_nights_avg_ntm                       <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights_avg_ntm                       <dbl> 90, 730, 730, 3~
$ calendar_updated                             <lgl> NA, NA, NA, NA,~
$ has_availability                             <lgl> TRUE, TRUE, TRU~
$ availability_30                              <dbl> 27, 23, 29, 11,~
$ availability_60                              <dbl> 57, 53, 59, 11,~
$ availability_90                              <dbl> 87, 83, 89, 11,~
$ availability_365                             <dbl> 362, 358, 364, ~
$ calendar_last_scraped                        <date> 2021-12-24, 20~
$ number_of_reviews                            <dbl> 65, 0, 0, 1, 0,~
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 0, ~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, ~
$ first_review                                 <date> 2012-04-07, NA~
$ last_review                                  <date> 2020-01-06, NA~
$ review_scores_rating                         <dbl> 4.85, NA, NA, 4~
$ review_scores_accuracy                       <dbl> 4.95, NA, NA, 5~
$ review_scores_cleanliness                    <dbl> 4.81, NA, NA, 3~
$ review_scores_checkin                        <dbl> 4.97, NA, NA, 2~
$ review_scores_communication                  <dbl> 4.91, NA, NA, 4~
$ review_scores_location                       <dbl> 4.66, NA, NA, 4~
$ review_scores_value                          <dbl> 4.75, NA, NA, 5~
$ license                                      <lgl> NA, NA, NA, NA,~
$ instant_bookable                             <lgl> FALSE, FALSE, F~
$ calculated_host_listings_count               <dbl> 2, 2, 1, 1, 1, ~
$ calculated_host_listings_count_entire_homes  <dbl> 2, 1, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 1, 1, 1, ~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, ~
$ reviews_per_month                            <dbl> 0.55, NA, NA, 0~

Initial Preprocessing

Before we separate the data into two separate tables, we perform the following preprocessing steps:

  1. Convert all columns of the type date to the type character to help process NA values.
  2. Convert all of the different NA representations (blanks, None, N/A, NA) to NA.
  3. Convert all columns that contain dates back to the type data.
data <- data %>% 
        # (1) Convert dates to characters for NA values
        mutate(last_scraped = as.character(last_scraped),
               host_since = as.character(host_since),
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review = as.character(first_review),
               last_review = as.character(last_review),
               ) %>% 
        
        # (2) Homogenize NA values
        #*# Taken from: https://rpubs.com/Argaadya/create_table_sql
        mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>%  #*#
        # mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>% 
        
        # (3) Convert character strings back to date type
        mutate(last_scraped = as.Date(last_scraped),
               host_since = as.Date(host_since),
               calendar_last_scraped = as.Date(calendar_last_scraped),
               first_review = as.Date(first_review),
               last_review = as.Date(last_review))

Host Table

Since a host can have many listings, it’s beneficial to split our initial data into two tables before inserting into our database. The first table that we create, is the host_info table with the same attributes as shown in the E/R Diagram above.

Data Cleaning

We are now ready to extract and clean data from our initial data table in order to construct our host_data table. This is done and the following four steps:

  1. Extract the relevant columns from our initial data table. Note that we use the : syntax to grab many columns at once. This command syntax is inclusive.
  2. Remove duplicate rows using the distinct() function from the dplyr library.
  3. Convert the host_since column back to the type character. This is required since RSQLite does not support data of the type date. We can convert back to the correct type when performing queries.
  4. Can we use the str_remove_all() function from the stringr library to convert the host_verifications sublists into simple strings. e.g. “[‘email’, ‘phone’]” becomes “email, phone”.
    # (1) Extract host data 
    host_data <- data %>% 
        select(host_id:host_identity_verified, 
               calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
    
    
    # (2) Remove duplicate values
    host_data <- host_data %>% distinct()
    
    
    # (3) Convert dates
    # Note that this will need to converted back to type = date for analysis
    host_data  <- host_data %>% mutate(host_since = as.character(host_since)) 
    
    
    # (4) Clean host verification column
    host_data <- 
        host_data %>% 
        mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))

We can now view our clean data:

rmarkdown::paged_table(host_data)

Host Data Insertion

In order to facilitate the insertion of data into our RSQLite database, we the following insert_to_sql() function.

The insert_to_sql() function

This function performs a number of operations prior to insertion:

  1. Create a vector of the column names in the data argument.
  2. Makes sure that all NA out values are in the correct form. For all non NA values, replace the double quotes with single quotes and bookend strings containing multiple quotes with double quotes.
  3. Join all column data into a single string, transform NA values to NULL or data insertion and trim all additional whitespace.
  4. Prepend and append each string with a left and right parentheses respectively before joining all values into a single string and removing all \\’s. Finally, use the paste0() function to convert our string into a query prior to insertion in our database’s corresponding relevant table.
  5. Insert the now prepared data into the database.
insert_to_sql <- function(con, table, data){
    
    
    # (1)
    column_name <- paste(names(data), collapse = ", ")
    
    # (2)
    data_new <- data %>% 
        mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                str_replace_all('"', "'") %>% # Replace " with '
                                paste0('"', . , '"') # Add " before and after string
        )
        )
    
    value_data <- apply(data_new, MARGIN = 1,
                        function(x) x %>%
                            paste(collapse = ",") %>% # Join all column into single string
                            str_replace_all("\\bNA\\b", "NULL")  %>% # Create NULL from NA
                            str_trim() # remove unnecessary whitespace
    )
    # (4)
    query_value <- paste(value_data) %>% 
        paste0("(", ., ")") %>% # Add bracket before and after string
        paste(collapse = ", ") %>% # Join all values into single string
        str_remove_all("\\\\") %>% # Remove \\ from string
        paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
    
    # (5)
    dbSendQuery(con, query_value)
}

Host_info Table Creation

Now that our host_data table is clean. We can create the equivalent table as a query, initially as a string listing the table’s columns, before creating the empty table in our database using our con object:

#################### Create table for host info
    query <- "CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )"

Data Insertion

Now create the empty table in our database.

dbSendQuery(con, query)
<SQLiteResult>
  SQL  CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )
  ROWS Fetched: 0 [complete]
       Changed: 0

Schema Verification

Next, check the schema of our database so far.

res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
   cid                                         name           type
1    0                                      host_id            INT
2    1                                     host_url    VARCHAR(50)
3    2                                    host_name   VARCHAR(100)
4    3                                   host_since    VARCHAR(50)
5    4                                host_location   VARCHAR(500)
6    5                                   host_about VARCHAR(10000)
7    6                           host_response_time    VARCHAR(50)
8    7                           host_response_rate    VARCHAR(50)
9    8                         host_acceptance_rate    VARCHAR(50)
10   9                            host_is_superhost        BOOLEAN
11  10                           host_thumbnail_url   VARCHAR(500)
12  11                             host_picture_url   VARCHAR(500)
13  12                           host_neighbourhood    VARCHAR(50)
14  13                          host_listings_count            INT
15  14                    host_total_listings_count            INT
16  15                           host_verifications   VARCHAR(500)
17  16                         host_has_profile_pic        BOOLEAN
18  17                       host_identity_verified        BOOLEAN
19  18               calculated_host_listings_count            INT
20  19  calculated_host_listings_count_entire_homes            INT
21  20 calculated_host_listings_count_private_rooms            INT
22  21  calculated_host_listings_count_shared_rooms            INT
   notnull dflt_value pk
1        0         NA  1
2        0         NA  0
3        0         NA  0
4        0         NA  0
5        0         NA  0
6        0         NA  0
7        0         NA  0
8        0         NA  0
9        0         NA  0
10       0         NA  0
11       0         NA  0
12       0         NA  0
13       0         NA  0
14       0         NA  0
15       0         NA  0
16       0         NA  0
17       0         NA  0
18       0         NA  0
19       0         NA  0
20       0         NA  0
21       0         NA  0
22       0         NA  0

Table Verification

Finally, insert our host_data table into the equivalent table in our RSQLite database using the insert_to_sql() function as defined above.

insert_to_sql(con, "host_info", host_data)

Database Verification

We can verify the contents of our newly created and populated RSQLite database table host_info as follows:

res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10") 

out_db <- fetch(res) 
dbClearResult(res)

rmarkdown::paged_table(out_db)

Listing Table

Data Cleaning

Data Insertion

Database Verification

Conclusion